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Abstract 

Several indices that measure the degree of balance of a rooted phylogenetic tree 
have been proposed so far in the literature. In this work we define and study a 
new index of this kind, which we call the total cophenetic index: the sum, over 
all pairs of different leaves, of the depth of their least common ancestor. This 
index makes sense for arbitrary trees, can be computed in linear time and it 
has a larger range of values and a greater resolution power than other indices 
like Colless' or Sackin's. We compute its maximum and minimum values for 
arbitrary and binary trees, as well as exact formulas for its expected value for 
binary trees under the Yule and the uniform models of evolution. As a byproduct 
of this study, we obtain an exact formula for the expected value of the Sackin 
index under the uniform model, a result that seems to be new in the literature. 

Keywords: Phylogenetic tree. Imbalance index, Cophenetic value, Sackin 
index 



1. Introduction 

A phylogenetic tree is a representation of the shared evolutionary history 
of a set of extant species. From the mathematical point a view, it is a leaf- 
labeled rooted tree, with its leaves representing the extant species under study, 
its internal nodes representing common ancestors of some of them, the root 
representing the most recent common ancestor of all of them, and the arcs 
representing direct descendants through mutations. 

One of the most thoroughly studied shape properties of phylogenetic trees 
is their balance, that is, the degree to which the children of internal nodes tend 
to have the same number of descendant taxa. This global degree of balance 
of a tree is usually quantified by means of a single number generically called 
an balance index. The two most popular balance indices are Sackin's and 
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Colless' [3] (see §2.2), but there are many more [3 Chap. 33], and Shao and 
Sokal [5T1 p. 1990] exphcitly advise to use more than one such index to quantify 
tree balance. 

Such balance indices only depend on the topology of the trees, not on the 
branch lengths or the actual taxa labeling their leaves. Since it is believed that 
the raw topology of a phylogenetic tree already reflects, at least to some extent, 
the evolutionary processes that have produced it [3 Chap. 33], these indices 
have also been widely used as tools to test stochastic models of evolution p^[2T]. 

Two of the most popular stochastic models of evolutionary tree growth are 
the Yule and the uniform models. The Yule, or Equal-Rate Markov model 
[SIHTIj starts with a single node and, at every step, a leaf is chosen randomly 
and uniformly, and it is replaced by a cherry, i.e., a phylogenetic tree consisting 
only of a root and two leaves. Finally, once the desired number of leaves is 
reached, the labels are assigned randomly and uniformly to the leaves. This 
corresponds to a model of evolution where, at each step, each currently extant 
species can give rise with the same probability to two new species. Under 
this model different trees with the same number of leaves may have different 
probabilities. In contrast, the main feature of the uniform, or Proportional to 
Distinguishable Arrangements model |19j is that all phylogenetic trees with the 
same number of leaves have the same probability. From the point of view of 
tree growth (51[53], this corresponds to a process where, starting with a node 
labeled 1, at the k-th step a new pendant arc, ending in the leaf labeled k + 1, 
is added either to a new root or to some edge (being all possible locations of 
this new pendant arc equiprobable) . Notice that this is not an explicit model 
of evolution, only of tree growth. Several properties of the distributions of 
Sackin's and Colless' indices have been studied in the literature under these 
models H 1 H HOI [HI [ISl HZl [M 121] ■ 

In this paper we propose a new balance index, the total cophenetic index. It 
is defined as the sum of the cophenetic values [22] of all pairs of different leaves. 
The main features of our index are that, unlike Colless' index, it makes sense 
for arbitrary (i.e., not necessarily fully resolved) trees; as Colless' and Sackin's 
indices, it can be easily computed in linear time; its range of values is larger 
than Colless' and Sackin's (up to 0{n^), instead of 0(n^)), and it has a greater 
resolution power than those indices. 

We compute the maximum and minimum values of our index, both in the 
arbitrary and the binary cases, and explicit formulas for its average value under 
the Yule and the uniform models for binary trees. We actually deduce its average 
value under the uniform model from an explicit formula for the average value 
of the Sackin index. This average value was known until now only for its limit 
distribution [3], and our formula seems thus to be new in the literature. 

The rest of this paper is organized as follows. In a first section we introduce 
the basic notations and facts on phylogenetic trees that will be used henceforth, 
and we recall some basic facts on the Sackin and the Colless indices. Then, 
in Section 3, we define our total cophenetic index $ and we establish its basic 
properties. In Section 4 we compute its maximum and minimum values, and 
then, in subsequent sections, we compute its expected value under the Yule and 
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the uniform models. We finally devote a last section to conclusions and the 
discussion of two preliminary numerical experiments involving 



2. Preliminaries 

2.1. Phylogenetic trees 

In this paper, by a phylogenetic tree on a set S of taxa we mean a rooted 
tree with its leaves bijcctively labeled in the set S. To simplify the language, 
we shall always identify a leaf of a phylogenetic tree with its label. We shall use 
the term phylogenetic tree with n leaves to refer to a phylogenetic tree on the 
set {1, . . . , n}. We shall denote by L{T) the set of leaves of a phylogenetic tree 
T and by Vint{T) its set of internal nodes. 

A phylogenetic tree is binary, or fully resolved, when all its internal nodes 
are bifurcating, that is, when every internal node has exactly two children. 

Whenever there exists a path from m to w in a phylogenetic tree T, we shall 
say that w is a descendant of u and also that u is an ancestor of v. The cluster 
of a node w in T is the set Ct{v) of its descendant leaves, an we shall denote by 
the cardinal |Ct(w)|, that is, the number of descendant leaves of v. 

Given a node w of a phylogenetic tree T, the subtree of T rooted at v is the 
subgraph of T induced on the set of descendants of u. It is a phylogenetic tree 
on Ct{v) with root this node v. 

The lowest common ancestor (LCA) of a pair of nodes u, w of a phylogenetic 
tree T, in symbols LCAt{u, v), is the unique common ancestor of them that is 
a descendant of every other common ancestor of them. 

The depth 6t{v) of a node d in a phylogenetic tree T is the length (in number 
of arcs) of the unique path from the root r to v. 

A rooted caterpillar is a binary phylogenetic tree all whose internal nodes 
have a leaf child: see Fig. [l](a). A rooted star is a phylogenetic tree such that 
all its leaves have depth 1: see Fig. [l](b). 



Figure 1: (a) A rooted caterpillar with n leaves, (b) The rooted star with n leaves. 

Let T be a binary phylogenetic tree. For every v £ Vi„t{T), say with children 
Vi,V2, the balance value of v is balx{v) — |kt(^i) ~ kt(''^2)|- An internal node 
V oi T IS balanced when balriv) ^ 1. So, a node v with children vi and V2 is 
balanced if, and only if, {kt{vi), kt{v2)} — {L'*T(t')/2j, \kt{v)/2]}. 

We shall say that a binary phylogenetic tree T is maximally balanced when 
all its internal nodes are balanced. Recurrently, a binary phylogenetic tree is 




(a) 



(b) 
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maximally balanced when its root is balanced and both subtrees rooted at the 
children of the root are maximally balanced. Notice that, for any number n 
of nodes, the topology of a maximally balanced tree with n leaves is fixed, 
and therefore two maximally balanced trees with the same number of leaves 
differ only in their labeling. Fig. [2] depicts the maximally balanced trees with 
n — 2, ... ,6 leaves, up to relabelings. 




Figure 2: Maximally balanced trees. 

Let Tn (resp., BTn) be the set of isomorphism classes of phylogenetic trees 
(resp, binary phylogenetic trees) with n leaves. It is well known [3 Ch. 3] that 
\BTi \ — 1 and, for every n ^ 2, 

\BTn\ ^ (2n - 3)!! = {2n - 3)(2n - 5) • • • 3 • 1. 

No closed formula is known for the cardinal \Tn\, only recurrences or generating 
functions (see again [7| Ch. 3] and the references therein). 

An ordered m-forest on a set S is an ordered sequence of m phylogenetic 
trees (Ti, T2, . . . , T^), each Ti on a set Si of taxa, such that these sets Si are 
pairwise disjoint and their union is S. An ordered forest is binary when it 
consists of binary trees. Let Tm,n (resp., BFm.n) be the set of isomorphism 
classes of ordered m-forests (resp., binary ordered m-forests) on a set S with 
\S\ = n. It is known (see, for instance, [HI Lem. 1]) that for every n ^ m ^ 1, 

I _ (2n- m - l)!m 
[n — m)!2" 

Again, no closed formula is known for |Jvi.„|. 

2.2. Balance indices 

Several balance indices have been proposed so far in the literature [3 p. 563]. 
The two most popular ones are the Sackin index [5D] and the Colless index [B]. 
The Sackin index of a phylogenetic tree T G 7n is defined as the sum of the 
depths of its leaves: 

S{T) = j2^T{^). 

i=l 

Alternatively [2], 

S{T) = ^t{v). 
veVi„t{T) 
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On the other hand, the CoUess index of a binary phylogenetic tree T is defined 
as 

C(T) = J2 bahiv). 

This Colless index has been extended to non-binary trees by defining balriv) = 
for every non-bifurcating internal node [21J. 

It is straightforward to notice that these two indices depend only on the 
topology of the tree, and they are invariant under isomorphisms and relabelings 
of leaves. This is desirable in a balance index, because the degree of symmetry 
of a tree depends only on its shape. 

Both Sackin's and Colless's indices reach their maximum value exactly at 
caterpillars, which are clearly the more imbalanced trees, and they reach their 
minimum on BTn at the maximally balanced trees [HIIH]. In both cases, the 
maximum value is in 0{n^). But they may also reach their minimum on BTn at 
other trees. For instance, for n = 6, both indices take their minimum value at 
the two trees T, T' depicted in Fig. [s] T' is maximally balanced, but T is not so. 
Actually, it is easy to check that Sackin's index is invariant under interchanges 
of cousins, which may produce trees with different degrees of symmetry but the 
same Sackin index. 




T T' 



Figure 3: Two trees having minimum Sackin's and CoUess' indices on BTq: S(T) = S{T') = 
16 and C(T) = C(T') = 2. 

The main drawback with Colless' index is its difficult meaningful general- 
ization to non-binary trees. Moreover, as Fig. |3] shows, although not every 
interchange of cousins yields trees with the same Colless index, there are still 
interchanges of cousins that modify the symmetry of the trees but preserve this 
index. 

The expected values of these indices on BTn have been studied under the 
Yule and the uniform models. Recall that, under the Yule model, different trees 
in BTn may have different probabilities: namely, a tree T with n leaves has 
probability [1 [H] 

2n— 1 ^ 

Under the uniform model, all trees in BTn are equiprobable, and thus they have 
probability 
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Let Cn and Sn be the random variables defined by choosing a tree T S BTn 
and computing C{T) or S{T), respectively. The following facts are known about 
the expected values of these random variables: 

• Under the Yule model, 

L"/2J 

- EyiCn) = {n mod 2) + n 1/j 0. 

n 

- EY{Sn) = '2nY: 1/j iini- 

• Under the uniform model, 

Eu{C,,),Eu{S^)- ^rv"'^ [3]. 
We shall actually prove in this paper (see Theorem [22]) that 
p (Q \ ^ F { 2, 2, 2-n , 

where 3F2 is a hypergeometric function [1]. 



3. The total cophenetic index 

For every pair of leaves i,j in a phylogenetic tree T, their cophenetic value 
is the depth of their least common ancestor: 

Vrihi) = ^T{LCAT{i,j)). 

Definition 1. The total cophenetic index of a phylogenetic tree T G Tn is the 
sum of the cophenetic values of its pairs of different leaves: 

This index can be seen as an extension of Sackin's: instead of adding up the 
depths of the leaves (that is, the depths of the LCA of every leaf and itself), 
$(T) adds up the depths of the LCA of every pair of leaves in T. Notice also 
that, as Sackin's and CoUess' indices, $(T) only depends on the topology of T, 
and in particular it is invariant under permutations of its labels. 

Fig. |4] shows all possible topologies of phylogenetic trees with 5 leaves, and 
their total cophenetic indices. Although we shall return on it later for trees with 
an arbitrary number n of leaves, notice that the rooted star has the smallest 
total cophenetic value, 0; the binary tree with the smallest total cophenetic 
value is the maximally balanced; and the tree with the largest total cophenetic 
value is the caterpillar. 

The following alternative expression for <I?(T) will be useful in many proofs. 
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#(T) = 



<I>(T) = 1 



$(T) = 2 



<I>(T) = 3 




$(r) = 7 $(T) = 8 *(T) = 9 $(T) = 10 



Figure 4: All phylogenetic trees with 5 leaves, up to relabelings, and their total cophenetic 
index. 



Lemma 2. Let T e 7^ &e a phylogenetic tree with root r. Then, 



veV,„t{T)-{r} 

Proof. For every v S Vi„t(T) — {r} and for every i,j S L{T), let 



1 ifi,jeCT(w) 
otherwise 



Then, (j)T{i,j) = J2 Ivihi) and thus 



2 



E 

i'ey.„t(T)-{r} 



Corollary 3. For every T € Tn, '^{T) can be computed in time 0{n). 



□ 



Proof. The vector {KTiv))veVi„t{T)-{r} can be computed in linear time by travers- 
ing in post order the tree T [26l §3.2], and then, by the last lemma, <i>(T) is 
computed in linear time from this vector. □ 
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Lemma 4. Let T E Tn be a phylogenetic tree with root r, and let Ti, . . . ,T/j, 
/c ^ 2, he the subtrees rooted at the children of r; cf. Fig^ Then, 

4=1 i=l V / 




Figure 5: 

Proof. Let Zi be the root of T^, i = 1, . . . ,k, and r the root of T. Then, by 
Lemma [2j 




□ 

This shows that the total cophenetic index is a recursive tree shape statistic 
in the sense of [TT] . 

Next lemma shows that the total cophenetic index is local, in the sense that 
if two trees differ only on a rooted subtree, then the difference between their 
total cophenetic values is equal to that of these subtrees. Sackin's and Colless' 
indices also satisfy this property. 

Lemma 5. Let Tq and Tq be two phylogenetic trees with L(Tq) — L(Tq) C 
{1, . . . , n}, let T € Tn be such that its subtree rooted at some node z is Tq, and 
let T' € Tn be the tree obtained from T by replacing Tq by Tq as its subtree rooted 
at z. Then 

$(T) - $(T') = $(To) - $(T^). 

Proof. Without any loss of generality, assume that L{Tq) — L{Tq) = {1, . . . , m}, 
with m ^ n. Let k = St{z) — Sx'iz). Then, for every i,j ^ m, 

fT{i,j) = k + 'fTo{hj), fT'{i,j) = k + (pT'{i,j). 
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On the other hand, ipT{hj) = fT'ihj) ii i > m or j > m. Therefore 

□ 

The nodal distance drii^j) between a pair of leaves i,j is the length of 
the unique undirected path connecting them; equivalently, it is the sum of the 
lengths of the paths from LCA{i,j) to i and j. The total area [T7 of a tree 
T g 7^ is defined as 

DiT)= J2 

There is an easy relation between $(T), S{T) and D{T), which will be used 
several times in this paper. 

Lemma 6. For every T Cz Tn, 

(n- l)S'(T) = 2(t){T) + D{T). 
Proof. It is straightforward to check that, for every i,jE L{T), 
(5t(«) + StU) = driij) + 2(pT{i,j). 

Therefore, 

2<P{T) + DiT) = J2 (2^T(*,j)+dT(*,j))- E i^T{i) + STU)) 

71 

= {n-l)J2ST{^)^{n-l)SiT). 

□ 

4. Trees with maximum and minimum $ 

In this section we determine which trees in Tn and BT„ have the largest and 
smallest total cophenetic indices. We begin by establishing two lemmas that 
will allow us to find the trees with the maximum $ on 7^. 

Lemma 7. Let Ti,...,Tk, with k ^ 3, be an ordered forest on {l,...,m}. 

Consider the trees Tq,Tq e Tn described in Fig. Then, $(Tq) — <E'(To) > 0. 
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Figure 6: The trees To and Tq in the statement of Lemma [t] 



Proof. With the notations of Fig. [6j notice that 



$(r^)-$(ro) = J2 

«ey„t(T^)-{r} 



kt^ (v) 



E 

veV,„tiTo)-{r} 



2 



> 0. 



□ 



Corollary 8. For every non-binary phylogenetic tree T e Tn, there always 
exists a binary phylogenetic tree T' such that $(T') > $(r). 

Proof. Let T e 7^ be a non-binary phylogenetic tree. Then it contains an 
internal node z whose rooted subtree looks like the tree Tq in the previous 
lemma, for some /c > 3. By Lemma [s] and the last lemma, if T' g 7^ is the 
tree obtained from T by replacing Tq by Tq as its subtree rooted at z, then 
$(T') - $(T) > 0. □ 

Therefore, the maximum total cophenetic index is reached at a binary tree. 

Lemma 9. Let to ^ 4, Zet 2 ^ fc ^ rn — 2, let Ti be any binary tree on 
{k + 1, . . . , to}, and let Tq and Tq be the phylogenetic trees in BTm depicted in 
Fig.\^ Then, <S?{T^) - ^Tq) > 0. 




Figure 7: The trees Tq and Tq in the statement of Lemma |9] 
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Proof. By Lemma [2] and recalling that |L(ri)| = m — fc, we have that 
X , N ,?7i— fc\ I k\ t k — l\ /3\ f 2^ 

$(ro) - 0(Ti) + ' 

m — 1 



and hence, since m — k ^ 2, 




0m) 



□ 



Proposition 10. T/ie trees in Tn with maximum total cophenetic index are 
exactly the rooted caterpillars Kn, and this maximum is $(_R'„) = . 

Proof. By Corollary [Sj any tree in 7^ with maximum total cophenetic index will 
be binary. Let now T £ BTn and assume that it is not a caterpillar. Therefore, 
it has an internal node z of largest depth without any leaf child; in particular, 
all internal descendant nodes of z have some leaf child. Thus, and up to a 
relabeling of its leaves, the subtree of T rooted at z has the form of the tree Tq 
m Fig. m for some fc ^ 2 and some / ^ fc + 2. But then, by Lemma [9] (taking 
as Ti the caterpillar subtree rooted at the parent of the leaf fc) , the tree Tq also 
depicted in Fig. [8] has a strictly larger total cophenetic index. Then, by Lemma 
[sj if we replace in T the subtree rooted at z by this tree Tg, we obtain a new 
tree T' with <i>(T') > $(r). This implies that no tree other than a caterpillar 
can have the largest total cophenetic index. 




Figure 8: The trees Tq and Tg in the proof of Proposition [To] 

As far as the total cophenetic index of the rooted caterpillar Kn with n leaves 
depicted in Fig. [l](a) goes, since the parent of the leaf labelled j, for j = 2, . . . , n, 

n. □ 



has J descendant leaves, by Lemma 



we have that $(i^„) = J2 (2) 



It is obvious that minimum total cophenetic index is 0, and it is attained 
only at the rooted star trees, depicted in Fig. [l](b). Therefore, the range of 
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^ on Tn goes from to (3). This is one order of magnitude larger than the 
range of Sackin's and CoUess' indices, whose maximum value, reached also at 
the rooted caterpillars, has order O(n^) [9tfT8 t [2T ] . 

Let us characterize now those binary phylogenetic trees with smallest total 
cophenetic index. 

Lemma 11. Let Ti^T2,T^,Ti he an ordered binary forest on {1,...,™}, let 

Xi = \L{Ti)\, for i — 1,2,3,4, and assume that xi ^ X2, 2^3 ^ 3:4 and xi > X3. 
Let Tq the phylogenetic tree depicted in Fig^(a), and let T € BTn (n^ m) he 
a binary phylogenetic tree having Tg as a subtree rooted at some node. If <^{T) 
is minimum in BTn, then x^^ X2- 



'T4 



'T4 



'T2 



(a) To 



Figure 9: (a) The tree To in the statement of Lemma (b) The tree Tq in the proof of 
Lemma 1111 

Proof. Assume that X2 > X4. We shall show that, in this suitable 
interchange of cousins in Tq produces a tree with smaller total cophenetic index, 
which in particular will imply that <i>(T) cannot be the minimum in BTn- 

Assume that the tree T in the statement has the subtree Tq rooted at a node 
z. Consider the tree Tq obtained by interchanging in Tq the subtrees T2 and T4 
(see Fig. [9j(b)) and let T' be the tree obtained from T by replacing Tq by Tq 
as its subtree rooted at z. Then, by Lemma [2] 

$(T') - $(T) = <^>m) - $(To) 

' xi + X4^ ^ ^2 + _ ^a;i + a;2^ _ ^2^3 + X4 ' 

— X1X4 + X2X3 — X1X2 — X3X4 — {xi — 2;3)(a;4 — X2) < 
which shows that $(T') < $(T). □ 

From the proof of the last lemma we deduce that if, in the tree Tq in 
Fig.|9l(a), |L(Ti)| ^ |T(T3)| and |T(T2)| 7^ |i(T4)|, and if we interchange T2 
and T4, then the resulting tree has always a different total cophenetic index. 

Lemma 12. Let Ti, T2, be an ordered binary forest on {1, . . . ,m — 1}, let Xi = 

\L(Ti)\, for i = 1,2, and assume that xi ^ X2. Let Tq the phylogenetic tree 



depicted in Fig 10 (a), and let T G BTn be a binary phylogenetic tree having Tq 



as a subtree rooted at some node. //$(T) is minimum in BTn: then Xi ~ X2 — 1. 
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(a) To (b) U 



Figure 10: (a) The tree To in tlie statement of Lemma |12[ (b) The tree Tq in the proof of 
Lemma 1121 



Proof. Assume that xi > 1. We shall show that, again in this suitable 
interchange of cousins in Tq produces a tree with smaller total cophenetic index. 

Assume that the tree T in the statement has the subtree To rooted at a 
node z. Let T' by the tree obtained from T by replacing Tq by the subtree Tq 
described in Fig. llOl(b). Then: 



$(T') - $(T) = $(T^) - $(To) 



< 



which shows that $(T') < $(T). □ 

The last two lemmas show that, unlike what happens with Sackin's and 
Colless' indices, any interchange of cousins that changes the balance of their 
grandparent always changes the total cophenetic index of a tree. 

Theorem 13. For every T e BTn, ^{T) is minimum on BTn if, and only if, 
T is maximally balanced. 

Proof. Assume that T e BTn is not maximally balanced, and let z be a non- 
balanced internal node in T with largest depth. Assume that a and b are its 
children, with kt{o-) ^ Kxib) + 2. 



If is a leaf, then, by Lemma 12 K,T{a) — 2 and therefore k,t{(^) ^ 3 = 
urib) + 2. Therefore, a and b are internal, and hence balanced. Let Tq be the 
subtree of T rooted at z, represented in Fig. [9] (a), and let Xi = \L{Ti)\, for 
i = 1,2,3,4; without any loss of generality, we shall assume that xi ^ X2 and 
3^3 ^ Xi and thus, since a and b are balanced, X2 = xi or xi — 1 and Xi = X3 
or 2:3 — 1. Then, xi + X2 = Kxia) ^ Kxib) + 2 = X3 + X4 + 2 implies that 
2xi ^ 2x3 + 1, and hence that xi > x^. 



Therefore, by Lemma 11 if $(T) is minimum in BTn, it must happen that 
xi > X3 ^ Xi ^ X2. Since it forbids the equality xi — X2, it implies that 
xi — X2 + 1 and therefore X2 = x^ ~ X4. But then xi + X2 ~ 2x2 + 1 ^ 
X3 + 0:4 + 2 = 2x2 + 2, against the assumption that z is not balanced. □ 

So, the only binary trees with minimum $ are the maximally balanced. Let 
us compute now this minimum value of $ on BTn- 
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Lemma 14. For every n, let f{n) be the minimum of ^ on BTn- Then, /(I) = 
/(2) ^ and 

fin) = fi\n/2^) + /(Ln/2J) + (^^^/^^^ + (^L'V2J^^ ^ ^ 3^ 

Proof. This recurrence for f{n) is a direct consequence of Lemma |4] and the fact 
that the root of a maximally balanced tree in BTn is balanced and the subtrees 
rooted at their children are maximally balanced. □ 

Proposition 15. For every n ^ 0, let a{n) is the highest power of 2 that divides 
n\. Then, for every n ^ 1, 

n-l 



k=0 

Proof. The sequence (a(n))„ is sequence A011371 in Sloane's On-Line Encyclo- 
pedia of Integer Sequences [25] , where we learn that it satisfies the recurrence 

a{n) = [n/2j + a([n/2j). 

Let now (x(n))„ denote the sequence of partial sums of (a(n))„, which is se- 
quence A174605 in Sloane's Encyclopedia. Then, the sequence {x{n))n starts 
with x{0) ~ x{l) — and it satisfies the recurrence 

x{n)-x{n-l) = a{n) = [n/2\ + a{[n/2\) = [n/2\ + x{[n/2\) - x{[n/2\ - 1). 

We want to prove that f{n + 1) = x{n), for every n > 0. Since /(I) = 
/(2) = 0, it remains to check the equality 

fin + 1) - fin) = [n/2\ + fi[n/2\ + 1) - fi[n/2\), for 71 2. 

We prove this equality with the help of Lemma |4] and by distinguishing four 
cases, depending on the residue of n mod 4. 



If n = Am, then 



fin + 1) - fin) = /(2m + 1) + /(2m) + + (\") 

-(/(2m) + /(2m) + (t) + (t)) 

= /(2m + l)-/(2m) + f™+Vf2™) 
= /(2m, + 1) - f (2m) + 2m 

= fi\n/2\+l)~fi\nl2\)+[n/2\ 

If n = 4m, + 1 , then 

fin + 1) - fin) = /(2m + 1) + /(2m + 1) + f + CT'^ 



-(/(2m + l) + /(2m) + f™+i) + f™)) 
/(2m + 1)- /(2m) + (2"+^) -(2™) 
/(2m + 1) - /(2m) + 2m 
/(Ln/2J +l)-/(Ln/2j) + Ln/2j 
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• If n = 4m + 2, then 



f{n + 1) - fin) - /(2m + 2) + /(2m + 1) + (2™+') + 

-(/(2m + 1) + /(2m + 1) + + (2™+i)) 

= /(2m + 2) - /(2m + 1) + {''"\+^) - 
= /(2m + 2) - /(2m + 1) + 2m + 1 
-/(K2J+1)-/(K2J)+K2J 



• If n = 4m + 3, then 

fin + 1) - fin) = /(2m + 2) + /(2m + 2) + (2"+') + ('Y') 
-(/(2m + 2) + /(2m + 1) + + f 

= /(2m + 2) - /(2m + 1) + - 
= /(2m + 2) - /(2m + 1) + 2m + 1 
= /(Ln/2j+l)-/(Ln/2j)+Ln/2j 



This completes the proof. □ 

In particular, this yields a new meaning and a new recurrence for sequence 
A174605 in Sloane's Encyclopedia. 



5. Expected value of $ under the Yule model 

Let <&„ be the random variable that chooses a tree T G BTn and computes its 
total cophenctic index $(T). In this section we determine the expected value of 

under the Yule model. To do this, we shall make use of the following lemma, 
which can be useful to study the expected value under the Yule model of other 
binary recursive tree shape statistics in the sense of [llj . 

Lemma 16. Let I be a mapping that associates to each phylogenetic tree a real 
number M satisfying the following two conditions: 

(a) It is invariant under tree isomorphisms and relabelings of leaves. 

(b) There exists a mapping / : N x N — > M such that, for every phylogenetic 
trees T,T' on disjoint sets of taxa S, S", respectively, 

iiT-r) = iiT) + iir) + fi\s\,\s'\). 

For every n ^ 1, let In be the random variable that chooses a tree T G BTn 
and computes liT), and let Eyiln) be its expected value under the Yule model. 
Then, 

71 — 1 n — 1 

Eriln) = ^ (2 Erih) + ^ /(fc, n-k)). 

k=l k=l 
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Proof. First of all, notice that if Tk G T{Sk), with Sk C {1, . . . , n} with \Sk\ = k, 
andr;_, er({l,...,n}\5fc), then 

Py{TCTU) = 7 \^PY{Tk)PY{TU) 

("-l)(fe) 

where Py denotes the probability of a phylogenetic tree under the Yule model. 

This assertion is a direct consequence of the explicit probabilities of T^, T'^_]^ and 
TiTT^-k under the Yule model given in §2.2, and the fact that Vint{Tk^T^_i.) = 
Vint{Tk) U Vj„t(r^_j,) U {r} (where r denotes the root of TCT'^_k), these unions 
being disjoint. 

Let us compute now i?y(/„) using its very definition: 



EY{In)= ^(T)-Py{T) 
n-1 

= E E E E i{TCK-k)-PY{n-TU) 

fe=l s^,c{l,...,n} Tkel3T{Sk)T'_ el3T{S^) 

=\j:(i) e e iHn)+i(TU) 

fc=l V / neBTkT'^eBTn-k 



fcfcO' n-k 

+f{k,n-k)) ■ ^ Py{Tk)PY{T;,_k) 



1 

= E E E (^(^fc) + ^(^n-k) + f{k, n - k))PY{n)PY{TU) 

fc=l Tfc t;_j^ 
^ n— 1 

= E (E E /(T^O^r(r.)Pv(T;_,) 

fe=l Tk t;_^ 

+ E E J(T^fe)Pr(Tfc)Py(T^,) 
+ E E /(A:,n-A:)^r(rfc)PF(T;_fc)) 

^ n— 1 

= ^ E (E^(^'^)^^(^'^) + E + f{k,n- k)) 

^ n— 1 

= ^(i^y(/fc) + Sy(/„-fe) + - fc)) 

fe=i 

n— 1 n— 1 

= (2 E (^'^) + E /(^' " - ^)) 



□ 

Theorem 17. Under the Yule model, the expected value of is 

n ^ 

£r($„) = n(n-l)-2n^T 

i=2 * 
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Proof. Lemma |4] implies that $ satisfies the hypothesis of Lemma 16 with 
f{k,n^k) = (f) + {^~^). Therefore 

n— 1 n — 1 /i \ / 7 \ n— 1 /, 

k\ I n — k\\ ^ / k 



fc=i fe=i \ / \ / k -. 

and hence 

fc=i fc=i V / 



Then, 



fc=l k=l V / 



n— 1 n— In — 2 

k=l 

n-l\ n-2 2 

n - 1 \ 2 j ^ n - 1 ' n-2 ^ I 2 

/ fc=i V 

EY{^n-i) + n-2 



n- 1 

To solve this equation, rewrite it as 

1 / N 1 . , n~2 

--By $„ = r-By $„-i + 

n n — 1 n 

Setting Xn = EY{^n)/n, the sequence (x„)„ satisfies 

n-2 

Xn = Xn-1 H , startmg with X2 — 0. 

n 

Therefore 

-«-E^ = (--2)-2E'=(n-l)-2E7 
and thus, finally, 

" 1 

-E'yf^ri) = = n(n — 1) — 2n > - 



I '■ — ' I '■ — ' I 

3 2—3 i—2 



□ 



Let Sn stand for the random variable that chooses a tree T„ e BTn and com- 

n 

putes its Sackin index S{Tn); cf. §2.2. Notice that, since i?y(S'„) = 2n ^ 1/j 
[10], we have that 

We have not been able to find a direct reason for this equality. 
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Corollary 18. Sy($„) = + (1 - 2^)n - 2nh\{n) + o{n). 

So, the order 0{n'^) of the expected value under the Yule model of the total 
cophenetic index on BTn is larger than the order 0(n log (n)) of the expected 
values of Sackin's and Colless' indices 

From the expected values of the Sackin and the total cophenetic indices, 
we can deduce the expected value of the total area D on BTn under the Yule 
model. 

Corollary 19. Let Dn be the random variable that chooses a tree T G BTn o,nd 
computes its total area D{T). Under the Yule model, its expected value is 

EviDn) = 2n{n + 1) " ~ 2n(n - 1) 

i=2 * 

Proof. From Lemma |6] we deduce that 

2$„ + A> = {n-l)Sn, 

and therefore 

n 

EyiDn = {n-l)EY{Sn) - 2EY{^n) = 2n(n + 1) V - - 2n{n- 1). 

^-^ i 

1=2 

□ 

Remark 20. In [15, p. 143, eq. (35)], it is claimed that 

" 1 5 

EviDn) = 2n{n + 1) ^ ^ - i^n{n - 1), 

i=2 

which cannot be correct: since all three trees T E BTs have D{T) = 8, it 
must happen that Ey(D^) = 8, while the expression given in Joe. cit. yields 
Ey{D^) = 5. And incidentally, our formula does yield the correct value in this 
case. 



6. Expected value of $ under the uniform model 

In this section we determine the expected value of under the uniform 
model. This expected value of $„ will be easily deduced, through Lemma [6j 
from the expected value of the total area, which was obtained in [12 , and the 
expected value of the Sackin index, which we obtain in Theorem[22]below. This 
last formula is, to our knowledge, new. 

Since, under the uniform model, all trees in Tn have the same probability, 
l/(2n — 3)!!, the expected value of Sn under the uniform model is 

(2n-3)!! ■ 

So, we need to compute the numerator in this fraction. 
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(2n — fc — 3)!fc^ 

Lemm^ 21. For every 3, ^ ^1^) - ^ E fa _ fc - lV2»-fe-i ' 

TgBT„ fe=l ^ 

Proof. For every fc = 1 , . . . , n — 1 , let 

Cfe,„ = \{T e 6r„ I <5t(1) = fc}| = |{r e \ <5t(«) = fc}| for every 1 ^ i ^ n. 
Then 



E ^(^) = E E'^^« = E E ^Ti^) 

n n—1 



TeBT: 



i=l k=l 

n n—1 n—1 

= E E ^ • ^ I = fc}i - 71 e ^ • 

1=1 A: = l fc=l 

It remains to compute Ck.n for fc 1- To do so, notice that every tree T e STn 



such that 6{1) — k wiU have the form described in Fig. 11 Therefore, it is 
determined by the ordered fc-forest Ti, T2, . . . , Tj, on {2, . . . , n}, and thus 



Ck,n — \J^k,n-l\ — 



(2n - fc-3)!fc 



(n - fc- l)!2»-fe-i^ 
from which the expression in the statement follows. 



□ 




Figure 11: The structure of a tree T with 5^(1) = k. 

Now, recall that the (generalized) hypergeometric function pFq is defined [T] 

as 

p / ai, Op \ _^{ai)k---{ap)k 



where (a)^ := a • (a + 1) • • • (a + fc — 1). Many popular software systems, like 
Mathematica or R, have implementations of these functions. 
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Theorem 22. The expected value of the random variable Sn under the uniform 
model is 

n ^ / 2, 2, 2 - n 



^^^^"^ " 2^^^^^ h 4'-2n '2 
Proof. By the last lemma, we have that 

p (o.^ Et^BT^ S{T) ^ n (2n - A; - 3)!fc^ 

(2n-3)!! (2n-3)!! ^ (n-A;-l)!2»-'=-i' 

Now 

nk^{2n - k - 3)\ _ nk'^{2n -k- 3)!2"-2(n - 2)! 

(2n-3)!!(n-fc-l)!2"-'=-i ~ (2n - 3)!(n - A: - l)!2"-fc-i 

_ nP(2n-fc-3)!2"-^(n-2)!fc! 
~ (2n-3)!(n-fc-l)!2"-'=-ifc! 
_ nfc22'=-i("-i) 

and thus 

ra-l 



" ^ fe=i I fc J 



fe=i 

n k'^2^-'^{n-l){n-2){n-3)---{n-k) 
n-1 ^ (2n - 3)(2n - 4)(2n - 5) • • • (2n - fc - 2) 

n ^ P2fc-i(^ „ 2)(n - 3) • • • (n - A;) 
2n- 3 ^ (2n-4)(2n-5)---(2n-A;-2) 

fc22fc-i(2 _ n)(2 - n + 1) • • • (-n + A;) 



71-1 



~ On - Q r4- 



2n - 3 ^ (4 - 2n)(4 - 2n + 1) • • • (2 - 2n + A;) 

n 2 + 1)^2'=(2 - n)(2 - n + 1) • • • (1 - n + A;) 



2«'-3;^ (4-2n)(4-2n + l)---(3-2n + A;) 

n ((A; + l)!)2(2-n)(2-n+l)---(l-n + A;)-2'= 

~ 2n - 3 ^ (A:!)2(4 - 2n)(4 - 2n + 1) • • • (3 - 2n + A:) 

n (2)fc(2)fc(2-n)fc 2^ ^ n / 2, 2, 2 - n „ 

2n-3^ (l)fe(4-2n)fc fc! 2n - 3^ ' V 1, 4 - 2n ' 

as we claimed. □ 

We have now the following result. 
Theorem 23. Under the uniform model, the expected value of is 

F(^\ (A( 1 J,(2,2,2-n.^ 1 (2n-2)!! \ y^ /^ 
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Proof. The expected values under the uniform model of Sn and D„ are: 
EuiSn) = ^-^3^2 ( I l^2n' by Theorem[22] 

Then, by Lemma [6j 

77 — 1 1 



n — 


1 


2 




n — 


1 






il 






)l 



n 



2, 2, 2-71 



277-3-' -"V 1, 4-277 '7 21,2/ (277-3)!! 



-3i^2l T' 7 ;2 









ft 




(2: 




(2^ 



277-3^ V 1, 4- 2n 

The assertion i?y($„) '--^ ^n^/^ comes easily from Lemma [oj and the facts 
that E{Sn) ~ y/nn^/^ [3], and that, using Stirling's approximation for large 
factorials, Eu{D„) - ^n^/^ [T2]. □ 



7. Discussion and conclusions 



In this paper we have introduced a new balance index for phylogenetic trees, 
the total cophenetic index This index makes sense for arbitrary phylogenetic 
trees, it can be computed in linear time, and it has a larger range of values 
than Sackin's or CoUess' indices. We have computed its maximum and mini- 
mum values for binary and arbitrary phylogenetic trees, and its expected value 
under the Yule and the uniform models. In a future work we plan to study 
other statistical properties of <&, like its variance, its limiting distribution or its 
correlation to other balance indices. 

From the point of view of the measurement of the degree of symmetry of 
a tree, our index outperforms the resolution power of Sackin's and CoUess' 
indices. We already saw some hints of this property in the previous sections: 
for instance, in Theorem [T3j where we proved that the only trees T £ BTn that 
have minimum $(r) are the maximally balanced, something that is not true 
in general for Sackin's and CoUess' indices (recall Fig. [s]); or in the lemmas 
previous to the proof of this theorem, where we saw that any interchange of 
cousins that modifies the balance of their grandparent also modifies the value of 
$. As a further evidence of this greater resolution power, we have estimated the 
probability that a pair of trees Ti , G BTn have J(Ti) = /(Tj), for / = C, 5, 
To do so, for every 77 = 2, ... , 10^ we have chosen randomly a number N of pairs 
of trees in BTn (for the first few values of n, N was taken to be |yBTn|, but 
starting at 71 = 8, we took = 3000), and computed, for I — C,S, $, 

, , , number of pairs {Ti,T2) with 77 leaves such that /(Ti) =^ I{T2) 
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Fig. [7] summarizes the results. It plots log(p„(/)) for the three balance indices 
as a function of log(n). We can see that that the total cophenetic index has the 
lowest such estimated probability of a tie. We plan to perform a deeper study 
of the probability of ties for the different balance indices in a future paper. 




Figure 12: Log-log plot of the estimated probability of a tie for three balance indices. 

This greater resolution power of $ makes it a better candidate to be used 
to test evolutionary hypotheses. We have performed a preliminary such test 
on the TrecBASE database [H]. We have considered the numbers n of leaves 
for which the TreeBASE contains at least 20 binary phylogenetic trees with n 
leaves, and for each such n we have computed the mean of the total cophenetic 
indices of the corresponding binary trees. Fig. [7] plots the log of these means 
as a function of log(n). We have added the curves of the log of the expected 
values of $„ under the Yule distribution (lower curve) and under the uniform 
distribution (upper curve), again as a function of log(n). This figure shows 
that the total cophenetic indices of the binary phylogenetic trees in TreeBASE 
are better explained by the uniform model than by the Yule model. We also 
plan to report in a future paper on more extensive tests on stochastic models of 
evolutionary processes using the total cophenetic index. 
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Log of oumOer of leaves 



Figure 13: Log-log plots of the mean of the total cophenetic index of the binary trees in 
TreeBASE with a fixed number n of leaves, of Eyi^n) (lower curve) and Eij(^„) (upper 
curve) . 



References 

[1] W.N. Bayley, Generalized Hypergeometric Series. Cambridge Tracts in 
Mathematics and Mathematical Physics 32, Stechert-Hafner Service Inc. 
(1964). 

[2] M. G. B. Blum, O. Frangois, On statistical tests of phylogenetic tree imbal- 
ance: The Sackin and other indices revisited. Mathematical Biosciences, 195 
(2005), 141-153. 

[3] M. G. B. Blum, O. Frangois, S. Janson, The mean, variance and limiting 
distribution of two statistics sensitive to phylogenetic tree balance. Ann. 
Appl. Probab. 16 (2006), 2195-2214. 

[4] J. Brown, Probabilities of evolutionary trees. Syst. Biol. 43 (1994), 78-91. 

[5] L. L. Cavalli-Sforza, A. Edwards, Phylogenetic analysis. Models and estima- 
tion procedures. Am. J. Hum. Genet., 19 (1967), 233-257. 

[6] D. H. CoUess, Review of "Phylogenetics: the theory and practice of phylo- 
genetic systematics" . Sys. Zool, 31 (1982), 100-104. 

[7] J. Felsenstein, Inferring Phylogenies. Sinauer Associates Inc., 2004. 

[8] E. Harding, The probabilities of rooted tree-shapes generated by random 
bifurcation. Adv. Appl. Prob. 3 (1971), 44-77. 

[9] S. B. Heard, Patterns in Tree Balance among Cladistic, Phenetic, and Ran- 
domly Generated Phylogenetic Trees. Evolution 46 (1992), 1818-1826 



23 



[lo: 
[11 

[12 

[13; 

[14 
[15 

[16; 
[17; 
[is; 

[19 

[2o; 

[21 
[22 

[23; 

[24 

[25 

[26 

[27; 



M. Kirkpatrick, M. Slatkin, Searching for evolutionary patterns in the 
shape of a phylogenetic tree. Evolution 47 (1993), 1171-1181. 

F. Matsen, Optimization Over a Class of Tree Shape Statistics. IEEE/ACM 
Trans. Comput. Biol. Bioinformatics, 4 (2007), 506-512. 

A. Mir, F. Rossello, The mean value of the squared path-difference dis- 
tance for rooted phylogenetic trees. Journal of Mathematical Analysis and 
Applications 371 (2010), 168-176. 

A. Mooers, S. B. Heard, Inferring evolutionary process from phylogenetic 
tree shape. Quart. Rev. Biol. 72 (1997) 31-54. 

V. MoreU, TreeBASE: the roots of phylogeny. Science 273 (1996), 569-560. 



jhttp : //www ■ treebase ■ org| 



W. H. Mulder, Probability distributions of ancestries and genealogical dis- 
tances on stochastically generated rooted binary trees. J. Theor. Biol. 280 
(2011), 139-145. 

J. S. Rogers, Response of tree imbalance to number of terminal taxa. Sys. 
Biol. 42 (1993), 102-105. 

J. S. Rogers, Central moments and probability distributions of Colless's 
coefficient of tree imbalance. Evolution 48 (1994), 2026-2036. 

J. S. Rogers, Central moments and probability distributions of three mea- 
sures of phylogenetic tree imbalance, Sys. Biol. 45 (1996), 99-110. 

D. E. Rosen, Vicariant Patterns and Historical Explanation in Biogeogra- 
phy Syst. Biol. 27 (1978), 159-188. 

M. J. Sackin, "Good" and "bad" phenograms. Sys. Zool, 21 (1972), 225- 
226. 

K.T. Shao, R. Sokal, Tree balance. Sys. Zool, 39 (1990), 226-276. 

R. Sokal, F. Rohlf, The Comparison of Dendrograms by Objective Methods. 
Taxon 11 (1962), 33-40. 

M. Steel, A. McKenzie, Distributions of cherries for two models of trees. 
Math. Biosc. 164 (2000), 81-92. 

M. Steel, A. McKenzie, Properties of phylogenetic trees generated by Yule- 
type speciation models. Math. Biosc. 170 (2001), 91-112. 

The On-Line Encyclopedia of Integer Sequences (2010), published electron- 
ically at http : / /oeis . org/ , 

G. Valiente, Algorithms on Trees and Graphs. Springer- Verlag (2002). 

G. U. Yule, A mathematical theory of evolution based on the conclusions 
of Dr J. C. WiUis. Phil. Trans. Royal Soc. (London) Series B 213 (1924), 
21-87. 



24 



